Growing up, my parents always emphasized that a complete and happy life meant getting married. So, in this project, I want to use data to explore the sources of happiness for people in different marital statuses. I'm particularly interested in understanding if not getting married, and being an independent individual, means one can't be happy
The whole dataset: HappyDB is a corpus of 100,000 crowd-sourced happy moments. You can read more about it on https://megagon.ai/happydb-a-happiness-database-of-100000-happy-moments/. But I only use a subset of this dataset to conduct data analysis in this report.
Firstly I join ‘cleaned_hm.csv’ and 'demographic.csv' together called 'join_df1'. To make data handling easier down the line, I've converted 'marital status', 'country', and 'predicted category' into numerical codes for basic data exploration. Also, I've selected 'marital status', 'cleaned_hm' which includes specific phrases, and 'predicted category' to create a new dataframe called 'join_df2'
In this collection of 100,000 crowd-sourced happy moments, it turns out that single folks contributed the most to these joyful instances. Married people came in second, followed by those who are divorced.
Looking solely from the perspective of analyzing this corpus, it seems that single people tend to be happier, and those who are married also share a good amount of happiness. However, after experiencing less favorable outcomes in marriage, such as divorce or widowhood, those who are separated contribute fewer happy moments to the corpus
For single and widowed individuals, their happiness is often foucs on their achievements. On the other hand, for those who are married divorced orseparated, their happiness tends to focus on affection
For married individuals, their happiness largely stems from their children, their partner, and their work.
When married people feeling happy, what they say the most goes something like this:
For single individuals, their happiness largely stems from their fiends and their work.
When single people feeling happy, what they say the most goes something like this:
For divorced individuals, their happiness largely stems from their work and children.
When divorced people feeling happy, what they say the most goes something like this:
For those who are separated, their moments of joy tend to center more around the company of friends and the value they place on their personal time.
When seperated people feeling happy, what they say the most goes something like this:
For those who are widowed, their moments of happiness often stem from their house and children
When widowed people feeling happy, what they say the most goes something like this:
Whether we are in a marital relationship or not, we can all find happiness; it's just that the sources of joy differ. However, since single people contribute the most to the corpus of happy moments in this database, I can reasonably infer that one can be very happy without ever marrying! At the same time, given that widowed, separated, and divorced individuals contribute fewer happy moments, it's a reasonable assumption that an unhappy marriage might reduce the frequency of joyful moments.
And I wanna say, mom and dad, I can be very happy on my own, too. I find a lot of joy in my friends and my work, and I lead a very fulfilling life.
import pandas as pd
CDF = pd.read_csv('/Users/janicemeng/Desktop/Project1/data/data/cleaned_hm.csv')
CDF.head()
| hmid | wid | reflection_period | original_hm | cleaned_hm | modified | num_sentence | ground_truth_category | predicted_category | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 27673 | 2053 | 24h | I went on a successful date with someone I fel... | I went on a successful date with someone I fel... | True | 1 | NaN | affection |
| 1 | 27674 | 2 | 24h | I was happy when my son got 90% marks in his e... | I was happy when my son got 90% marks in his e... | True | 1 | NaN | affection |
| 2 | 27675 | 1936 | 24h | I went to the gym this morning and did yoga. | I went to the gym this morning and did yoga. | True | 1 | NaN | exercise |
| 3 | 27676 | 206 | 24h | We had a serious talk with some friends of our... | We had a serious talk with some friends of our... | True | 2 | bonding | bonding |
| 4 | 27677 | 6227 | 24h | I went with grandchildren to butterfly display... | I went with grandchildren to butterfly display... | True | 1 | NaN | affection |
Demographic = pd.read_csv('/Users/janicemeng/Desktop/Project1/data/data/demographic.csv')
Demographic.head()
| wid | age | country | gender | marital | parenthood | |
|---|---|---|---|---|---|---|
| 0 | 1 | 37.0 | USA | m | married | y |
| 1 | 2 | 29.0 | IND | m | married | y |
| 2 | 3 | 25 | IND | m | single | n |
| 3 | 4 | 32 | USA | m | married | y |
| 4 | 5 | 29 | USA | m | married | y |
Senselabel =pd.read_csv('/Users/janicemeng/Desktop/Project1/data/data/senselabel.csv')
Senselabel.head()
| hmid | tokenOffset | word | lowercaseLemma | POS | MWE | offsetParent | supersenseLabel | |
|---|---|---|---|---|---|---|---|---|
| 0 | 31526 | 1 | I | i | PRON | O | 0 | NaN |
| 1 | 31526 | 2 | found | find | VERB | O | 0 | v.cognition |
| 2 | 31526 | 3 | a | a | DET | O | 0 | NaN |
| 3 | 31526 | 4 | silver | silver | ADJ | O | 0 | NaN |
| 4 | 31526 | 5 | coin | coin | NOUN | O | 0 | n.artifact |
join_df1 = pd.merge(CDF, Demographic, on='wid', how='inner')
join_df1.head()
| hmid | wid | reflection_period | original_hm | cleaned_hm | modified | num_sentence | ground_truth_category | predicted_category | age | country | gender | marital | parenthood | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 27673 | 2053 | 24h | I went on a successful date with someone I fel... | I went on a successful date with someone I fel... | True | 1 | NaN | affection | 35 | USA | m | single | n |
| 1 | 27873 | 2053 | 24h | I played a new game that was fun and got to en... | I played a new game that was fun and got to en... | True | 1 | NaN | leisure | 35 | USA | m | single | n |
| 2 | 28073 | 2053 | 24h | I listened to some music and heard an entire a... | I listened to some music and heard an entire a... | True | 1 | NaN | leisure | 35 | USA | m | single | n |
| 3 | 33522 | 2053 | 24h | Went to see a movie with my friend | Went to see a movie with my friend | True | 1 | NaN | bonding | 35 | USA | m | single | n |
| 4 | 34522 | 2053 | 24h | Played guitar, learning a song on it | Played guitar, learning a song on it | True | 1 | NaN | leisure | 35 | USA | m | single | n |
join_df1.to_csv('/Users/janicemeng/Desktop/Project1/output_to_joined_csv.csv', index = False)
missing_values = join_df1.isnull().sum()
print(missing_values)
hmid 0 wid 0 reflection_period 0 original_hm 0 cleaned_hm 0 modified 0 num_sentence 0 ground_truth_category 86410 predicted_category 0 age 93 country 203 gender 79 marital 157 parenthood 78 dtype: int64
join_df1 = join_df1.dropna()
missing_values = join_df1.isnull().sum()
print(missing_values)
hmid 0 wid 0 reflection_period 0 original_hm 0 cleaned_hm 0 modified 0 num_sentence 0 ground_truth_category 0 predicted_category 0 age 0 country 0 gender 0 marital 0 parenthood 0 dtype: int64
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import io
import re
import string
import tqdm
import numpy as np
import multiprocessing
from gensim.models import Word2Vec
print(join_df1.columns)
Index(['hmid', 'wid', 'reflection_period', 'original_hm', 'cleaned_hm',
'modified', 'num_sentence', 'ground_truth_category',
'predicted_category', 'age', 'country', 'gender', 'marital',
'parenthood'],
dtype='object')
#Converts a column 'predicted_category' in a DataFrame into a categorical data type,creates a dictionary mapping category labels to numerical codes, and then adds a new column 'predicted_category_code' with these numerical codes assigned to each category.
join_df1['predicted_category'] = join_df1['predicted_category'].astype('category')
category_to_code = dict(enumerate(join_df1['predicted_category'].cat.categories))
print("Category to Code:", category_to_code)
join_df1['predicted_category_code'] = join_df1['predicted_category'].cat.codes
Category to Code: {0: 'achievement', 1: 'affection', 2: 'bonding', 3: 'enjoy_the_moment', 4: 'exercise', 5: 'leisure', 6: 'nature'}
sns.countplot(x='predicted_category_code',data=join_df1, palette='pastel')
<Axes: xlabel='predicted_category_code', ylabel='count'>
#In the HappyDB corpus, 'affection' is the most frequently occurring term, followed by 'achievement'. The third most common term is 'enjoy the moment'
join_df1['marital'] = join_df1['marital'].astype('category')
marital_to_code = dict(enumerate(join_df1['marital'].cat.categories))
print("marital to Code:", marital_to_code)
join_df1['marital_code'] = join_df1['marital'].cat.codes
marital to Code: {0: 'divorced', 1: 'married', 2: 'separated', 3: 'single', 4: 'widowed'}
join_df1['country'] = join_df1['country'].astype('category')
country_to_code = dict(enumerate(join_df1['country'].cat.categories))
print("country to Code:", country_to_code)
join_df1['country_code'] = join_df1['country'].cat.codes
country to Code: {0: 'AFG', 1: 'ALB', 2: 'ARE', 3: 'ARG', 4: 'ARM', 5: 'ASM', 6: 'AUS', 7: 'AUT', 8: 'BEL', 9: 'BGD', 10: 'BGR', 11: 'BHS', 12: 'BRA', 13: 'CAN', 14: 'CHL', 15: 'COL', 16: 'CRI', 17: 'CYP', 18: 'CZE', 19: 'DEU', 20: 'DNK', 21: 'DOM', 22: 'DZA', 23: 'ECU', 24: 'EGY', 25: 'ESP', 26: 'EST', 27: 'ETH', 28: 'FIN', 29: 'FRA', 30: 'GBR', 31: 'GHA', 32: 'GMB', 33: 'GRC', 34: 'GTM', 35: 'HRV', 36: 'IDN', 37: 'IND', 38: 'IRL', 39: 'ISL', 40: 'ITA', 41: 'JAM', 42: 'JPN', 43: 'KEN', 44: 'KNA', 45: 'KWT', 46: 'LKA', 47: 'LTU', 48: 'LVA', 49: 'MAC', 50: 'MDA', 51: 'MEX', 52: 'MKD', 53: 'MLT', 54: 'MYS', 55: 'NGA', 56: 'NIC', 57: 'NLD', 58: 'NOR', 59: 'NZL', 60: 'PAK', 61: 'PER', 62: 'PHL', 63: 'POL', 64: 'PRI', 65: 'PRT', 66: 'ROU', 67: 'RUS', 68: 'SGP', 69: 'SRB', 70: 'SVN', 71: 'SWE', 72: 'TCA', 73: 'THA', 74: 'TTO', 75: 'TUR', 76: 'TWN', 77: 'UGA', 78: 'UMI', 79: 'URY', 80: 'USA', 81: 'VEN', 82: 'VNM', 83: 'ZAF'}
join_df1['parenthood'] = join_df1['parenthood'].astype('category')
parenthood_to_code = dict(enumerate(join_df1['parenthood'].cat.categories))
print("parenthood to Code:", parenthood_to_code)
join_df1['parenthood_code'] = join_df1['parenthood'].cat.codes
parenthood to Code: {0: 'n', 1: 'y'}
join_df1['gender'] = join_df1['gender'].astype('category')
gender_to_code = dict(enumerate(join_df1['gender'].cat.categories))
print("gender to Code:", gender_to_code)
join_df1['gender_code'] = join_df1['gender'].cat.codes
gender to Code: {0: 'f', 1: 'm', 2: 'o'}
join_df1 = join_df1.drop(['reflection_period', 'original_hm', 'modified', 'reflection_period', 'num_sentence', 'ground_truth_category'], axis=1)
number_of_rows = join_df1.shape[0]
number_of_rows
14055
join_df2 = join_df1[['marital','marital_code','cleaned_hm','predicted_category','predicted_category_code','gender']]
join_df2
| marital | marital_code | cleaned_hm | predicted_category | predicted_category_code | gender | |
|---|---|---|---|---|---|---|
| 6 | single | 3 | I played a game for about half an hour. | leisure | 5 | m |
| 15 | married | 1 | When my family plan a abroad tour with me | affection | 1 | m |
| 19 | married | 1 | When my house ready to live with my family | affection | 1 | m |
| 23 | married | 1 | When my friend meet me today with expensive gi... | bonding | 2 | m |
| 25 | married | 1 | I was very happy when my son playing with whol... | affection | 1 | m |
| ... | ... | ... | ... | ... | ... | ... |
| 100494 | married | 1 | My tooth stopped aching after my dentist visit. | achievement | 0 | f |
| 100496 | married | 1 | I took a bath with my husband. | affection | 1 | f |
| 100526 | married | 1 | I got on the scales in the morning and I was 5... | achievement | 0 | f |
| 100529 | married | 1 | Quite dinner with my wife. | affection | 1 | m |
| 100533 | married | 1 | Yesterday evening I received a call from unkno... | bonding | 2 | m |
14055 rows × 6 columns
sns.countplot(x='marital',data=join_df2, palette='pastel')
<Axes: xlabel='marital', ylabel='count'>
marital_counts = join_df2['marital'].value_counts()
print(marital_counts)
single 7409 married 6040 divorced 465 separated 84 widowed 57 Name: marital, dtype: int64
ax = sns.countplot(data=join_df2, x='predicted_category', hue='marital')
plt.title('Predicted Category Distribution by Marital Status')
plt.xlabel('Predicted Category')
plt.ylabel('Count')
plt.legend(title='Marital Status')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
[Text(0, 0, 'achievement'), Text(1, 0, 'affection'), Text(2, 0, 'bonding'), Text(3, 0, 'enjoy_the_moment'), Text(4, 0, 'exercise'), Text(5, 0, 'leisure'), Text(6, 0, 'nature')]
#For single and widowed individuals, their happiness is often derived more from their achievements. On the other hand, for those who are married divorced orseparated, their happiness tends to come more from affection
join_df3 = join_df1[['gender','cleaned_hm','predicted_category']]
join_df3.head()
| gender | cleaned_hm | predicted_category | |
|---|---|---|---|
| 6 | m | I played a game for about half an hour. | leisure |
| 15 | m | When my family plan a abroad tour with me | affection |
| 19 | m | When my house ready to live with my family | affection |
| 23 | m | When my friend meet me today with expensive gi... | bonding |
| 25 | m | I was very happy when my son playing with whol... | affection |
sns.countplot(x='gender',data=join_df3, palette='pastel')
<Axes: xlabel='gender', ylabel='count'>
join_df4 = join_df1[['country','cleaned_hm']]
Top_Five_Countires = join_df4['country'].value_counts().head(5)
Top_Five_Countires
USA 10477 IND 2954 VEN 71 CAN 70 GBR 54 Name: country, dtype: int64
!pip install wordcloud
Requirement already satisfied: wordcloud in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (1.9.3) Requirement already satisfied: numpy>=1.6.1 in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from wordcloud) (1.24.3) Requirement already satisfied: pillow in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from wordcloud) (9.4.0) Requirement already satisfied: matplotlib in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from wordcloud) (3.7.1) Requirement already satisfied: contourpy>=1.0.1 in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from matplotlib->wordcloud) (1.0.5) Requirement already satisfied: cycler>=0.10 in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from matplotlib->wordcloud) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from matplotlib->wordcloud) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from matplotlib->wordcloud) (1.4.4) Requirement already satisfied: packaging>=20.0 in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from matplotlib->wordcloud) (23.0) Requirement already satisfied: pyparsing>=2.3.1 in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from matplotlib->wordcloud) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from matplotlib->wordcloud) (2.8.2) Requirement already satisfied: six>=1.5 in /Users/janicemeng/anaconda3/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
import plotly.express as px
from urllib import request
from wordcloud import WordCloud, STOPWORDS
import csv
import bs4
from tqdm.notebook import trange, tqdm
#Splitting each word, removing stopwords and punctuation
join_df2.loc[:, 'words'] = join_df2['cleaned_hm'].str.split()
/var/folders/vp/pthgq5853bq0jz4mbv9ykfcm0000gn/T/ipykernel_4305/2220580772.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy join_df2.loc[:, 'words'] = join_df2['cleaned_hm'].str.split()
join_df2['words']
6 [I, played, a, game, for, about, half, an, hour.]
15 [When, my, family, plan, a, abroad, tour, with...
19 [When, my, house, ready, to, live, with, my, f...
23 [When, my, friend, meet, me, today, with, expe...
25 [I, was, very, happy, when, my, son, playing, ...
...
100494 [My, tooth, stopped, aching, after, my, dentis...
100496 [I, took, a, bath, with, my, husband.]
100526 [I, got, on, the, scales, in, the, morning, an...
100529 [Quite, dinner, with, my, wife.]
100533 [Yesterday, evening, I, received, a, call, fro...
Name: words, Length: 14055, dtype: object
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
[nltk_data] Downloading package stopwords to [nltk_data] /Users/janicemeng/nltk_data... [nltk_data] Package stopwords is already up-to-date!
custom_stop_words = ['got', 'day', 'went', 'one', 'today', 'happy','made','able','found','month','new']
stop_words.update(custom_stop_words)
#stop_words
def remove_stopwords(word_list):
return [word for word in word_list if word.lower() not in stop_words]
join_df2.loc[:, 'words_joined'] = join_df2['words'].apply(remove_stopwords)
/var/folders/vp/pthgq5853bq0jz4mbv9ykfcm0000gn/T/ipykernel_4305/2137858264.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy join_df2.loc[:, 'words_joined'] = join_df2['words'].apply(remove_stopwords)
join_df2.head()
| marital | marital_code | cleaned_hm | predicted_category | predicted_category_code | gender | words | words_joined | |
|---|---|---|---|---|---|---|---|---|
| 6 | single | 3 | I played a game for about half an hour. | leisure | 5 | m | [I, played, a, game, for, about, half, an, hour.] | [played, game, half, hour.] |
| 15 | married | 1 | When my family plan a abroad tour with me | affection | 1 | m | [When, my, family, plan, a, abroad, tour, with... | [family, plan, abroad, tour] |
| 19 | married | 1 | When my house ready to live with my family | affection | 1 | m | [When, my, house, ready, to, live, with, my, f... | [house, ready, live, family] |
| 23 | married | 1 | When my friend meet me today with expensive gi... | bonding | 2 | m | [When, my, friend, meet, me, today, with, expe... | [friend, meet, expensive, gift] |
| 25 | married | 1 | I was very happy when my son playing with whol... | affection | 1 | m | [I, was, very, happy, when, my, son, playing, ... | [son, playing, whole] |
join_df2.loc[:, 'words_joined'] = join_df2['words_joined'].apply(' '.join)
/var/folders/vp/pthgq5853bq0jz4mbv9ykfcm0000gn/T/ipykernel_4305/2789987770.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
join_df2.loc[:, 'words_joined'] = join_df2['words_joined'].apply(' '.join)
join_df2.head()
| marital | marital_code | cleaned_hm | predicted_category | predicted_category_code | gender | words | words_joined | |
|---|---|---|---|---|---|---|---|---|
| 6 | single | 3 | I played a game for about half an hour. | leisure | 5 | m | [I, played, a, game, for, about, half, an, hour.] | played game half hour. |
| 15 | married | 1 | When my family plan a abroad tour with me | affection | 1 | m | [When, my, family, plan, a, abroad, tour, with... | family plan abroad tour |
| 19 | married | 1 | When my house ready to live with my family | affection | 1 | m | [When, my, house, ready, to, live, with, my, f... | house ready live family |
| 23 | married | 1 | When my friend meet me today with expensive gi... | bonding | 2 | m | [When, my, friend, meet, me, today, with, expe... | friend meet expensive gift |
| 25 | married | 1 | I was very happy when my son playing with whol... | affection | 1 | m | [I, was, very, happy, when, my, son, playing, ... | son playing whole |
def preprocessing_text(join_df2):
df.words_joined = df.words_joined.str.replace(r'[^\w\s]', '-')
df.words_joined= df.words_joined.str.replace(' ', '_')
df.words_joined = df.words_joined.str.replace('-_-', '')
df.words_joined = df.words_joined.str.replace('-', ' ')
return join_df2
text = ' '.join([' '.join(words) for words in join_df2['words_joined']])
married_df = join_df2[join_df2['marital'] == 'married']
single_df = join_df2[join_df2['marital'] == 'single']
divorced_df = join_df2[join_df2['marital'] == 'divorced']
seperated_df = join_df2[join_df2['marital'] == 'seperated']
widowed_df = join_df2[join_df2['marital'] == 'widowed']
married_df.head()
| marital | marital_code | cleaned_hm | predicted_category | predicted_category_code | gender | words | words_joined | |
|---|---|---|---|---|---|---|---|---|
| 15 | married | 1 | When my family plan a abroad tour with me | affection | 1 | m | [When, my, family, plan, a, abroad, tour, with... | family plan abroad tour |
| 19 | married | 1 | When my house ready to live with my family | affection | 1 | m | [When, my, house, ready, to, live, with, my, f... | house ready live family |
| 23 | married | 1 | When my friend meet me today with expensive gi... | bonding | 2 | m | [When, my, friend, meet, me, today, with, expe... | friend meet expensive gift |
| 25 | married | 1 | I was very happy when my son playing with whol... | affection | 1 | m | [I, was, very, happy, when, my, son, playing, ... | son playing whole |
| 33 | married | 1 | When I shifted my new home | achievement | 0 | m | [When, I, shifted, my, new, home] | shifted home |
#Split the phrases into individual words, creating a list for each entry
words_split = married_df['words_joined'].str.split().tolist()
#Combined these lists into a single list to have all words together
all_words = [word for sublist in words_split for word in sublist]
#Used a special tool called Counter to help us count each word's occurrences.
from collections import Counter
word_counts = Counter(all_words)
top_15_words = word_counts.most_common(15)
print(top_15_words)
[('time', 459), ('wife', 364), ('husband', 340), ('son', 326), ('work', 311), ('last', 309), ('good', 303), ('family', 297), ('daughter', 294), ('happy.', 282), ('first', 252), ('really', 252), ('get', 229), ('home', 222), ('old', 214)]
word_counts = Counter(married_df['words_joined'])
join_df2.columns
Index(['marital', 'marital_code', 'cleaned_hm', 'predicted_category',
'predicted_category_code', 'gender', 'words', 'words_joined'],
dtype='object')
#Create a unified text block from all the individual phrases or words
married_text = " ".join(hm for hm in married_df['words_joined'].dropna())
single_text = " ".join(hm for hm in single_df['words_joined'].dropna())
divorced_text = " ".join(hm for hm in divorced_df['words_joined'].dropna())
seperated_text = " ".join(hm for hm in seperated_df['words_joined'].dropna())
widowed_text = " ".join(hm for hm in widowed_df['words_joined'].dropna())
#create wordcloud
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = set(),
min_font_size = 10).generate(married_text)
plt.figure(figsize = (3, 5), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
#Finds and gathers sentences that include common words
high_freq_words = ['husband', 'wife', 'work','daughter','son','time','new']
sentences_with_high_freq_words = {word: [] for word in high_freq_words}
for index, row in married_df.iterrows():
for word in high_freq_words:
if word in row['cleaned_hm'].lower():
sentences_with_high_freq_words[word].append(row['cleaned_hm'])
#The output is quite lengthy, so I've added comment markers to shorten it. You can remove these comment markers to see the full results.
#for word, sentences in sentences_with_high_freq_words.items():
# print(f"Sentences containing the word '{word}':")
# for sentence in sentences:
# print(f"- {sentence}")
# print("\n")
single_df = join_df2[join_df2['marital'] == 'single']
words_split = single_df['words_joined'].str.split().tolist()
all_words = [word for sublist in words_split for word in sublist]
word_counts = Counter(all_words)
top_15_words = word_counts.most_common(15)
print(top_15_words)
[('time', 469), ('friend', 438), ('really', 406), ('work', 396), ('good', 351), ('last', 331), ('friends', 300), ('first', 284), ('get', 278), ('happy.', 255), ('felt', 230), ('finally', 228), ('see', 220), ('came', 215), ('bought', 211)]
single_text = " ".join(hm for hm in single_df['words_joined'].dropna())
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = set(),
min_font_size = 10).generate(single_text)
plt.figure(figsize = (3, 4), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
high_freq_words = ['work', 'friend', 'found']
sentences_with_high_freq_words = {word: [] for word in high_freq_words}
for index, row in single_df.iterrows():
for word in high_freq_words:
if word in row['cleaned_hm'].lower():
sentences_with_high_freq_words[word].append(row['cleaned_hm'])
#for word, sentences in sentences_with_high_freq_words.items():
# print(f"Sentences containing the word '{word}':")
# for sentence in sentences:
# print(f"- {sentence}")
# print("\n")
divorced_df = join_df2[join_df2['marital'] == 'divorced']
divorced_df.head()
| marital | marital_code | cleaned_hm | predicted_category | predicted_category_code | gender | words | words_joined | |
|---|---|---|---|---|---|---|---|---|
| 158 | divorced | 0 | I made vacation plans with my daughter today f... | affection | 1 | f | [I, made, vacation, plans, with, my, daughter,... | vacation plans daughter Florida July. |
| 1152 | divorced | 0 | I picked my daughter up from the airport and w... | affection | 1 | f | [I, picked, my, daughter, up, from, the, airpo... | picked daughter airport fun good conversation ... |
| 2411 | divorced | 0 | I had the weekly high score in an online game ... | bonding | 2 | m | [I, had, the, weekly, high, score, in, an, onl... | weekly high score online game play friends. |
| 2430 | divorced | 0 | I met some friends for dinner and drinks follo... | bonding | 2 | m | [I, met, some, friends, for, dinner, and, drin... | met friends dinner drinks followed watching li... |
| 2434 | divorced | 0 | My teenage daughter finished her hardest exams... | affection | 1 | m | [My, teenage, daughter, finished, her, hardest... | teenage daughter finished hardest exams comple... |
words_split = divorced_df['words_joined'].str.split().tolist()
all_words = [word for sublist in words_split for word in sublist]
word_counts = Counter(all_words)
top_15_words = word_counts.most_common(15)
print(top_15_words)
[('work', 37), ('good', 26), ('daughter', 25), ('friend', 22), ('get', 22), ('time', 21), ('really', 20), ('first', 19), ('finally', 18), ('last', 17), ('see', 17), ('son', 17), ('happy.', 16), ('dinner', 15), ('came', 15)]
divorced_text = " ".join(hm for hm in divorced_df['words_joined'].dropna())
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = set(),
min_font_size = 10).generate(divorced_text)
plt.figure(figsize = (3, 4), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
high_freq_words = ['work', 'son', 'daughter']
sentences_with_high_freq_words = {word: [] for word in high_freq_words}
for index, row in divorced_df.iterrows():
for word in high_freq_words:
if word in row['cleaned_hm'].lower():
sentences_with_high_freq_words[word].append(row['cleaned_hm'])
'''
for word, sentences in sentences_with_high_freq_words.items():
print(f"Sentences containing the word '{word}':")
for sentence in sentences:
print(f"- {sentence}")
print("\n")
'''
'\nfor word, sentences in sentences_with_high_freq_words.items():\n print(f"Sentences containing the word \'{word}\':")\n for sentence in sentences:\n print(f"- {sentence}")\n print("\n")\n'
separated_df= join_df2[join_df2['marital'] == 'separated']
words_split = separated_df['words_joined'].str.split().tolist()
all_words = [word for sublist in words_split for word in sublist]
word_counts = Counter(all_words)
top_15_words = word_counts.most_common(15)
print(top_15_words)
[('friend', 10), ('time', 8), ('good', 8), ('go', 8), ('last', 6), ('get', 6), ('daughter', 5), ('really', 5), ('old', 4), ('watched', 4), ('talked', 4), ('free', 4), ('best', 4), ('son', 4), ('first', 4)]
separated_text = " ".join(hm for hm in separated_df['words_joined'].dropna())
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = set(),
min_font_size = 10).generate(separated_text)
plt.figure(figsize = (3, 4), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
high_freq_words = ['friend', 'good', 'time']
sentences_with_high_freq_words = {word: [] for word in high_freq_words}
for index, row in separated_df.iterrows():
for word in high_freq_words:
if word in row['cleaned_hm'].lower():
sentences_with_high_freq_words[word].append(row['cleaned_hm'])
'''
for word, sentences in sentences_with_high_freq_words.items():
print(f"Sentences containing the word '{word}':")
for sentence in sentences:
print(f"- {sentence}")
print("\n")
'''
'\nfor word, sentences in sentences_with_high_freq_words.items():\n print(f"Sentences containing the word \'{word}\':")\n for sentence in sentences:\n print(f"- {sentence}")\n print("\n")\n'
widowed_df = join_df2[join_df2['marital'] == 'widowed']
words_split = widowed_df['words_joined'].str.split().tolist()
all_words = [word for sublist in words_split for word in sublist]
word_counts = Counter(all_words)
top_15_words = word_counts.most_common(15)
print(top_15_words)
[('house', 5), ('son', 4), ('friend', 4), ('took', 4), ('bought', 3), ('two', 3), ('months', 3), ('wanted', 3), ('walked', 3), ('finally', 3), ('get', 3), ('really', 2), ('great.', 2), ('Seeing', 2), ('daughter', 2)]
separated_text = " ".join(hm for hm in separated_df['words_joined'].dropna())
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = set(),
min_font_size = 10).generate(widowed_text)
plt.figure(figsize = (3, 4), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
high_freq_words = ['house', 'dahugter','son']
sentences_with_high_freq_words = {word: [] for word in high_freq_words}
for index, row in widowed_df.iterrows():
for word in high_freq_words:
if word in row['cleaned_hm'].lower():
sentences_with_high_freq_words[word].append(row['cleaned_hm'])
'''
for word, sentences in sentences_with_high_freq_words.items():
print(f"Sentences containing the word '{word}':")
for sentence in sentences:
print(f"- {sentence}")
print("\n")
'''
'\nfor word, sentences in sentences_with_high_freq_words.items():\n print(f"Sentences containing the word \'{word}\':")\n for sentence in sentences:\n print(f"- {sentence}")\n print("\n")\n'